safety expert
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Sun, Yuhao, Xu, Zhuoer, Cui, Shiwen, Yang, Kun, Yu, Lingyun, Zhang, Yongdong, Xie, Hongtao
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Energy > Power Industry (1.00)
- (2 more...)
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation
Liu, Runtao, Chieh, Chen I, Gu, Jindong, Zhang, Jipeng, Pi, Renjie, Chen, Qifeng, Torr, Philip, Khakzar, Ashkan, Pizzati, Fabio
Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at https://safetydpo.github.io/.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > China > Hong Kong (0.04)
AI is overpowering efforts to catch child predators, experts warn
The volume of sexually explicit images of children being generated by predators using artificial intelligence is overwhelming law enforcement's capabilities to identify and rescue real-life victims, child safety experts warn. Prosecutors and child safety groups working to combat crimes against children say AI-generated images have become so lifelike that in some cases it is difficult to determine whether real children have been subjected to real harms for their production. A single AI model can generate tens of thousands of new images in a short amount of time, and this content has begun to flood both the dark web and seep into the mainstream internet. "We are starting to see reports of images that are of a real child but have been AI-generated, but that child was not sexually abused. But now their face is on a child that was abused," said Kristina Korobov, senior attorney at the Zero Abuse Project, a Minnesota-based child safety non-profit.
- North America > United States > Minnesota (0.25)
- North America > United States > Washington (0.05)
- North America > United States > California > Los Angeles County > Los Angeles (0.05)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
Child predators are using AI to create sexual images of their favorite 'stars': 'My body will never be mine again'
Predators active on the dark web are increasingly using artificial intelligence to create sexually explicit images of children, fixating especially on "star" victims, child safety experts warn. Child safety groups tracking the activity of predators chatting in dark web forums say they are increasingly finding conversations about creating new images based on older child sexual abuse material (CSAM). Many of these predators using AI obsess over child victims referred to as "stars" in predator communities for the popularity of their images. "The communities of people who trade this material get infatuated with individual children," said Sarah Gardner, chief executive officer of the Heat Initiative, a Los Angeles non-profit focused on child protection. "They want more content of those children, which AI has now allowed them to do."
- North America > United States > California > Los Angeles County > Los Angeles (0.25)
- Europe > United Kingdom (0.15)
- North America > United States > Wisconsin (0.05)
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Ghosh, Shaona, Varshney, Prasoon, Galinkin, Erick, Parisien, Christopher
As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment
- Europe > Germany (0.14)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China (0.04)
- Overview > Innovation (0.48)
- Research Report > Promising Solution (0.46)
Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT
Qi, Yi, Zhao, Xingyu, Khastgir, Siddartha, Huang, Xiaowei
Large Language Models (LLMs) [27], including Generative Pre-trained Transformer (GPT) [6] and Bidirectional Encoder Representations from Transformers (BERT) [13], have achieved state-of-theart performance on a wide range of Natural Language Processing (NLP) tasks. LLMs are gaining popularity and receiving increasing attention for their significant applications in knowledge reasoning [12, 52, 57]. ChatGPT is one of the LLMs applications, and probably the application, in the limelight. ChatGPT was used for collating literature and writing professional papers in fields like law [9], and medical education [30, 16]. OpenAI announced GPT-4 in March 2023 that can pass some of the bar exams to AP Biology [39]. These successful stories demonstrate that people have already gained experience in using LLMs, for their performance in handling complex content due to their massive training datasets and model capacity to process and learn from data, enabling their potential for complex tasks that require domain expert knowledge [38]. Given this, as researchers in the field of safety-critical systems, we pose a question: Can safety analysis make use of LLMs?
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (4 more...)
- Workflow (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Energy > Power Industry (1.00)
- (3 more...)
Deepfake AI tech could assist and empower online predators, safety expert warns
Criminals are taking advantage of AI technology to conduct misinformation campaigns, commit fraud and obstruct justice through deepfake audio and video. Australia's eSafety Commission has raised concerns about the potential for artificial intelligence (AI) to assist predators in grooming children online as the country debates restrictions on the emerging technology. Australian eSafety Commissioner Julie Inman Grant posted on Twitter that "the manipulative power of generative AI to execute on grooming and sextortion is no longer speculative." "eSafety is already receiving cyberbullying reports and image-based abuse reports around deepfakes," she wrote. "The fact is AI has been'exfiltrated into the wild' without guardrails."
Tesla's 'Full Self-Driving' Beta Software Used on Public Roads Lacks Safeguards
After Tesla released the latest prototype version of its driving assistance software last week, reports from owners have gained the attention of researchers and safety experts--both at CR and elsewhere--who have expressed concerns about the system's performance and safety. CR plans to independently test the software update, popularly known as FSD beta 9, as soon as our Model Y SUV receives the necessary software update from Tesla. So far, our experts have watched videos posted on social media of other drivers trying it out and are concerned with what they're seeing--including vehicles missing turns, scraping against bushes, and heading toward parked cars. Even Tesla CEO Elon Musk urged that drivers use caution when using FSD beta 9, writing on Twitter that "there will be unknown issues, so please be paranoid." FSD beta 9 is a prototype of what the automaker calls its "Full Self-Driving" feature, which, despite its name, does not yet make a Tesla fully self-driving.
- Automobiles & Trucks > Manufacturer (0.39)
- Transportation > Ground > Road (0.36)
Automakers propose policy changes to speed self-driving vehicle roll-out
Major automakers unveiled more than a dozen policy proposals Wednesday they say would make it easier to roll out autonomous vehicles on a large scale in the coming years. The Alliance for Automotive Innovation -- which represents most major automakers including General Motors Co., Ford Motor Co. and Fiat Chrysler Automobiles NV -- among other things asked federal policymakers to create a new vehicle class for AVs, and asked state policymakers to harmonize their policies to make it easier for automakers to test and deploy AVs across different states. "We are releasing this policy roadmap now because we are at a critical time in the development of this technology. Companies have invested billions of dollars into the research and development of this technology and those investments are paying off," said John Bozzella, president of the Alliance. "As our companies start to make plans and critical decisions about where and how and when to build and deploy these technologies, they need to know that policies are in place here in the U.S. that will support those plans and those decisions."
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (1.00)
Collision course: pedestrian deaths are rising – and driverless cars aren't likely to change that
In 2010, the small community of specialists who pay attention to US road safety statistics picked up the first signs of a troubling trend: more and more pedestrians were being killed on American roads. That year, 4,302 American pedestrians died, an increase of almost 5% from 2009. The tally has increased almost every year since, with particularly sharp spikes in 2015 and 2016. Last year, 41% more US pedestrians were killed than in 2008. During this same period, overall non-pedestrian road fatalities moved in the opposite direction, decreasing by more than 7%. For drivers, roads are as safe as they have ever been; for people on foot, roads keep getting deadlier. Through the 90s and 00s, the pedestrian death count had declined almost every year. No one would have confused the US for a walkers' paradise – at least part of the reason fewer pedestrians died in this period was that people were driving more and walking less, which meant that there were fewer opportunities to be struck. But at least the death toll was shrinking. The fact that, globally, pedestrian fatalities were much more common in poorer countries made it possible to view pedestrian death as part of an unfortunate, but temporary, stage of development: growing pains on the road to modernity, destined to decrease eventually as a matter of course. The US road death statistics of the last decade have blasted a hole in that theory.
- Europe > United Kingdom (0.05)
- North America > United States > New York (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (12 more...)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Information Technology (1.00)
- (2 more...)